# The following libraries would be used for understanding the dataset and performing analysis through
#graphs,visualization etc.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Load data into dataframe
customer_support_data = pd.read_csv('Dataset_aftersales.csv')
customer_support_data.head()
#size of dataframe in rows and cols
customer_support_data.shape
We can see that the dataset has 100000 rows and 19 columns or features. We now further explore this dataset by checking for datatypes, missing values, nulls etc.
#Descriptive stats on numeric data
customer_support_data.describe()
The above output provides some statistical results for the numeric variables. We can guess the presence of outliers in part count as the range of values is very high, between 0 to 51, and it doesn't seem quite a possibility of having 51 parts replaced in a product. The other column statistics are also shown with the min, max, avg , sd and quartile values.
#info() function provides information on number of columns, non null values and datatypes
customer_support_data.info()
The above output shows all the feature variables and their datatypes and number of non-null values. Looking at the results, we can conclude that some columns have null values. Our next step would be to explore these values.
#Check for duplicate rows
customer_support_data.duplicated().value_counts()
#Drop duplicate rows
customer_support_data.drop_duplicates(inplace=True)
#after dropping duplicate rows:
customer_support_data.shape
#Null value count
customer_support_data.isnull().sum().sort_values(ascending=False)
We see that there are multiple columns having null values. We can now calculate the null values percentage in the dataset.
#Heatmap for nulls
sns.heatmap(customer_support_data.isnull(), yticklabels=False)
#Percentage of missing values
(customer_support_data.isnull().sum()/len(customer_support_data)).sort_values(ascending=False)
We see above that nearly 87% of repeat_parts_sent and 45% of the parts_sent column data is missing. The percentage of missing data in product_type, agent_tenure_indays, contact_type and region is very low. Hence we can eliminate these rows as their absence would not impact the overall data analysis.
Missing data in the columns country, topic_category and issue_type cannot be eliminated as they constitute more than 10% of the actual data. As these are categorical fields, we cannot fill in the missing data using any one value, since the existing data is varied. Hence the best way to deal with them is to fill it with 'Unknown'.
In the further steps we will deal with the null values step by step:
customer_support_data.dropna(subset=['product_type'],axis=0, inplace=True)
len(customer_support_data.product_type) #To check if the rows have been dropped
customer_support_data.dropna(subset=['agent_tenure_indays'], axis=0, inplace=True)
len(customer_support_data.agent_tenure_indays)
customer_support_data.dropna(subset=['contact_type'], axis=0, inplace=True)
len(customer_support_data.contact_type)
customer_support_data.dropna(subset=['region'],axis=0, inplace=True)
len(customer_support_data.region)
Now we handle the missing values in columns issue_type, topic_category and country. As the values for these columns cannot be filled using any one particular value and since a significant amount of data is missing in each of these rows, the best approach would be to replace NaNs with 'Unknown'.
#Nulls in topic category
customer_support_data.topic_category.fillna('Unknown', inplace=True)
customer_support_data.topic_category.unique()
#Nulls in issue type
customer_support_data.issue_type.fillna('Unknown', inplace=True)
customer_support_data.issue_type.unique()
#Nulls in Country
customer_support_data.country.fillna('Unknown', inplace=True)
customer_support_data.country.unique()
Thus, all the nulls have been removed and now we are left with parts_sent and repair_parts_sent columns. To fill in null values in parts_sent column, we first need to create a new column using custom binary encoding and group by using the parts_ct to check if parts_ct is 0 when the parts_sent is 0.
#Handling null values for parts_sent
customer_support_data['parts_sent_encoded']=np.where(customer_support_data['parts_sent'].isnull(),0,1)
customer_support_data.groupby('parts_sent_encoded')['parts_ct'].apply(lambda x: list(np.unique(x)))
We create a dummy variable 'parts_sent_encoded' for custom binary encoding where all nulls are representated as 0 and the non-nulls are representated as 1. Looking at the above result, we infer that the when parts_sent is null, parts_ct is 0. This means that no part was sent when part_ct is 0. Thus, we can directly substitute the null values in parts_sent with 'No part sent' as inferred from above.
#replacing nulls in parts_sent with 'No part sent'
customer_support_data.parts_sent.fillna('No part sent', inplace=True)
As we don't have any supporting data for repeat_parts_sent column as parts_sent, we will fill the nulls with a value 'Unknown'.
#Nulls for column 'repeat_parts_sent'
customer_support_data.repeat_parts_sent.fillna('Unknown', inplace=True)
#All Nulls removed
customer_support_data.isnull().sum().sort_values(ascending=False)
customer_support_data.drop('parts_sent_encoded',axis=1,inplace=True)
We now check for unique values and inconsistencies in the data. We can also observe that the dataframe consists of both numerical and categorical columns. We can identify these columns for further analysis
numcols = customer_support_data.dtypes[customer_support_data.dtypes != "object"].index
print("Numerical columns: ", len(numcols))
catcols = customer_support_data.dtypes[customer_support_data.dtypes == "object"].index
print("Categorical columns: ", len(catcols))
#Numeric columns
numcols
#Categorical Columns
catcols
#Check for unique values in each column
customer_support_data.asst_id.unique()
customer_support_data.mnfture_wk.unique()
customer_support_data.agent_tenure_indays.unique()
customer_support_data.contact_manager_flg.unique()
customer_support_data.contact_type.unique()
Here we can see that 'Voice' and 'VOICE are the same but show inconsistency in the data format. We resolve this by replacing 'Voice with 'VOICE'
#Remove non -uniformity in contact_type
customer_support_data.contact_type.replace('Voice', 'VOICE', inplace=True)
customer_support_data.contact_type.unique()
customer_support_data.contact_wk.unique()
#Plot correlation matrix to understand interdependence between columns
plt.figure(figsize=(10,10))
sns.heatmap(customer_support_data.corr(), annot=True, fmt='0.2f')
plt.title("Correlation Matrix")
There is a highly positive correlation between manufacture week and contract start week. We can say that the contract won't start until the product is manufactured. Similarly contract start week and contract end week are correlated. The product warranty cannot end until it has started.
# Plot for No of requests by Product type
plt.figure(figsize=(10,10))
g=sns.countplot(customer_support_data.product_type)
for p in g.patches:
g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
ha='center', va='bottom',
color= 'black')
plt.title("Analysis of the number of requests for each product type")
Interpretation: The plot above shows that maximum number of requests (60000+) have come in for laptops. This is followed by Desktops and lastly by other Electronics. This data can be useful for ABC to manage the tech support team. ABC can hire more technicians skilled in Laptop repairs owing to higher demand.
#Plot for No of requests by repair type
g=sns.countplot(customer_support_data.repair_type)
plt.title("Analysis of repair type distribution")
for p in g.patches:
g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
ha='center', va='bottom',
color= 'black')
Interpretation: The above plot shows that on an average, 50000+ of the service requests/problems require hard repairs. This data is again useful for managing the tech support team based on the skills desired.
#Preferred mode of communication based on product type
plt.figure(figsize=(10,10))
g=sns.countplot('contact_type', hue='product_type', orient='h',data=customer_support_data)
plt.title("Analysis of preferred mode of communication based on product type")
for p in g.patches:
g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
ha='center', va='bottom',
color= 'black')
Interpretation: This plot tells us that the most preferred mode of communication is voice/calls. ABC can use this information to manage it's customer care as they now know that most requests will be coming through phone calls.
#Plot for no of requests by repair type filtered by product type
plt.figure(figsize=(10,5))
g=sns.countplot(customer_support_data.repair_type, hue='product_type', data=customer_support_data)
plt.title('Analysis of repair type required based on the product type')
for p in g.patches:
g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
ha='center', va='bottom',
color= 'black')
Interpretation: The above plot shows distribution of the repair type, filtered by product type. We can see that laptops repair requests are highest in both Hard and Soft repair type.
#region wise analysis of parts sent depending on product type
plt.figure(figsize=(10,10))
p1=sns.boxplot(y='region',x='parts_ct',hue='product_type', data=customer_support_data)
p1.set(xlim=(0,10))
plt.title("BoxPlot for analyzing parts_sent count across regions for each product type")
Interpretation: The above box plot shows us the distribution of part count in various regions, grouped by product type. We can see the part count sent for laptops is visibly the highest for all 3 regions with minimum value being 0 and median value being 1.
For Hogwarts and Middle Earth Region: The maximum part count is 5, 2 and 2 for Laptop, Desktop and other electronics respectively. The quartiles indicate that the range of part count is more widespread for Laptops, ranging from 0 to 5.
We can see the presence of outliers. This corresponds to a situation when the required part count exceeds 5 in case of laptops and exceeds 2 in the case of Desktops and other electronics.
For the Milky way Region:
The maximum part count is 7, 5 and 2 for Laptop, Desktop and other electronics respectively. The quartiles indicate that the range of part count is more widespread for Laptops, ranging from 0 to 7.
We can also see the presence of outliers. This corresponds to a situation when the required part count exceeds 7 in case of laptops and exceeds 5 in the case of Desktops and other electronics.
These outliers seem natural as there could be a situation when a repair may require more than 5 to 7 parts to be replaced/repaired. However, this situation is rare and hence can be rightly looked upon as an outlier.
ABC can utilize this information to manage the demand and supply of the parts that frequently need to be repaired depending upon the product or the issue faced. ABC's Research and Development team can use this information to eliminate the faults with these parts, coming up with newer versions that give them a competitive edge in the market.
#Distribution of products serviced in and out of warranty
plt.figure(figsize=(10,5))
g=sns.countplot(customer_support_data.contact_wk>customer_support_data.contract_end, hue= customer_support_data.product_type)
plt.title("Distribution of products serviced in and out of warranty")
for p in g.patches:
g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
ha='center', va='bottom',
color= 'black')
Interpretation: This plot tells us which how many service requests came in when the product was in and out of warranty. 'False' denotes products that were in warranty and the True denotes products serviced out of warranty. This analysis is helpful to understand what kind products would face an issue within a few months of manufacture as well as after the expiry of warranty. Here we see that about 18% of the requests came in for products out of warranty and roughly, 66% of these requests came in for Laptops.
ABC can map these out of warranty requests with product issues to understand which issues are likely to occur within a few months of manufacture and which of them can occur after the warranty has expired. This can be beneficial in setting up the warranty plans of the product, addressing commonly seen issues with permanent solutions and providing enhanced customer service. User documentation can be created for commonly occurring issues that customers may resolve themselves.
#Plot for no of requests by warranty, filtered by topic_category
plt.figure(figsize=(10,20))
list=['Unknown']
customer_support_data1 =customer_support_data[~customer_support_data.topic_category.isin(list)]
sns.countplot(customer_support_data1.contact_wk>customer_support_data1.contract_end, hue= customer_support_data1.topic_category)
plt.title("Distribution of problems seen in and out of warranty")
plt.legend(loc='upper right')
Interpretation: In this plot we can see the problems reported by tech service agents, categorized as problems occurred when product is in warranty vs problems occurred when product is out of warranty. We can see in the graph that issues related to booting, OS, hard drive, touchpad, battery, app/software or physical damage are seen more commonly when the product is out of warranty.
The ABC team can design their service request forms such that there is a drop down to select the Problem category so that this data is available easily. Help guides and manuals can be created for the commonly encountered problems. Service requests can be automated, and a seperate team can be created to provide expedite service for addressing known issues, reducing waiting time of the customers.
#Analysis of the involvement of the manager for a repair type
plt.figure(figsize=(10,5))
sns.catplot(x='repair_type', y='contact_manager_flg', hue='product_type', data=customer_support_data, jitter=False)
plt.tight_layout()
plt.title("Analysis of the involvement of the manager for a repair type")
Interpretation: The above plot can be used to analyze the severity/complexity of an issue based on whether a manager's involvement was necessary for resolving it. We can see the plot shows the involvement of contact manager for both hard and soft repair type, categorized by product type. We see that for laptops, for both hard and soft repairs, manager involvement was seen, while for Desktops manager involvement was required for performing hard repairs.
However, there may be other factors that contribute to this decision rather than just the product type or the repair type. This information could be useful in laying out the request resolution plan depending on the severity of the issue. Manager involvement may also mean that some issues require more access privileges which a technician may not have.
#No of requests for each topic category grouped by repair type
plt.figure(figsize=(100,100))
list=['Unknown']
customer_support_data1 =customer_support_data[~customer_support_data.topic_category.isin(list)]
g=sns.countplot(customer_support_data1.topic_category, hue='repair_type', data=customer_support_data1)
g.set_xticklabels(g.get_xticklabels(), rotation=45, ha="right", size=50)
plt.tight_layout()
plt.legend(loc='upper right')
for p in g.patches:
g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()),
ha='center', va='bottom',
color= 'black')
Interpretation: The above graph shows number of requests for each topic category, grouped by repair type. We can see that overall, the hard repair type for various topics is more prominent than the soft repair type. The requests are higher in number for booting, general queries, OS, Hard drive and System performance issues. Maximum hard repair type requests are seen for booting and LCD/Monitor issues.
#Analysis of no of parts sent for each issue, the hue denotes product type
plt.figure(figsize=(200, 100))
list=['Unknown']
customer_support_data1 =customer_support_data[~customer_support_data.issue_type.isin(list)]
p2=sns.catplot(x='issue_type',y='parts_ct', hue='product_type', data=customer_support_data1, height=4, aspect=4)
p2.set_xticklabels(rotation=90, ha="right")
#plt.tight_layout()
p2.set(ylim=(0,10))
plt.legend(loc='upper right')
Interpretation: In this plot, we see the number of parts that were sent for each issue_type. The hue parameter differentiates the part count by product type. Overall we can see that for Laptops, the part count ranges between 0 to 10 and a higher count can be seen for issues like HardDrive, Battery, Controller Cards, Optical drive, mechanical chassis, etc. The part count seems to be less for Desktop issues. This could indicate that a large number of Desktop issues can be handled without replacement of the parts.
#No of requests for each product type grouped by topic category
plt.figure(figsize=(20,20))
list=['Unknown']
customer_support_data1 =customer_support_data[~customer_support_data.topic_category.isin(list)]
p4=sns.countplot('product_type', hue='topic_category', data=customer_support_data1)
p4.set_xticklabels(p4.get_xticklabels(), rotation=45, ha="right", size=15)
plt.tight_layout()
plt.legend(loc='upper right')
Interpretation: The plot above shows the number of requests for each product, categorized by the topic_category. We can again infer that overall, maximum requests have come in for laptops and major problems faced are 'Booting','OS', 'Hard drive', 'Touchpad', 'LCD/Monitor' and 'System Performance'.
#Analysis of Agent Tenure across Countries where asset is located
plt.figure(figsize=(30,30))
list=['Unknown']
customer_support_data1 =customer_support_data[~customer_support_data.country.isin(list)]
customer_support_data1.head()
g=sns.catplot(x='country', y='agent_tenure_indays', data=customer_support_data1, height=5, aspect=3)
g.set_xticklabels(rotation=45, ha="right")
plt.tight_layout()
plt.title("Analysis of Agent Tenure across Countries")
Interpretation: This plot provides shows the Agent tenure in days across countries where the asset is located. This plot is helpful in analyzing which country has the most experienced Agents. This information can help a great deal in analyzing the response time on the service request when it is handled by an experienced agent as opposed to being handled by a novice agent. ABC can make use of this analysis for delegating requests to agents depending on the severity of the issue and the agent's expertise, in order to reduce response time and provide quicker solutions
# Plot of No of requests by issue type, filtered by product type
plt.figure(figsize=(200, 200))
list=['Unknown']
customer_support_data1 =customer_support_data[~customer_support_data.issue_type.isin(list)]
p2=sns.countplot('issue_type', hue='product_type', data=customer_support_data1)
p2.set_xticklabels(p2.get_xticklabels(),rotation=45, ha="right")
plt.tight_layout()
plt.title("No of requests by issue type, filtered by product type")
plt.legend(loc='upper right')
Interpretation:This plot shows the number of requests that have come by for issue, categorized by product type. This information is very useful to understand the commonly occurring issues in each product.
ABC can increase it's productivity and customer satisfaction index by predefining solutions for frequently occurring issues, resulting in low response time.
Pandas Profiling tool is used for a quick Exploratory data analysis for obtaining some descriptive statistics. I have added this step for showing some statistical analysis, which might have been covered and used in the earlier exploratory data analysis. This profiling tool is useful for a quick data exploration. I used the tool on cleaned data, but it can also be used on raw data to obtain some quick statistical insight into the data. The aim behind adding this code snippet was to showcase the usage of pandas profiling tool.
pip install pandas-profiling
from pandas_profiling import ProfileReport
prof = ProfileReport(customer_support_data)
prof